{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Exploration\n",
    "\n",
    "In this notebook, we'll take an opportunity to explore the data we'll be using.\n",
    "Data exploration and understanding is a fundamental step in Machine Learning.\n",
    "\n",
    "In this lab, the data we'll be looking at is the MNIST data set. It consists of 60,000 examples of handwritten digits, with a label corresponding to what the intended digit is. It's worth mentioning that in ML, 60,000 examples is considered a fairly small dataset. Creating an AI to label handwritten digits from the MNIST dataset is a very common, almost \"hello world\" style of AI program\n",
    "\n",
    "It's helpful to know a few characteristics about the dataset in this course in particular. Specifically how it's laid out. In the MNIST dataset we have provided we have done some preprocessing of the data. The data you'll be provided is in CSV format, with the first element corresponding to the actual value, and the rest of the elements corresponding to a 28x28 pixel image of the digit. That 28x28 pixels is flattened in the dataset into 784 individual pixel elements. Each pixel element is in the range of \\[0,255\\] which corresponds to a greyscale. In the case of a color image, each element might be a triple which would correspond to an (R,G,B) value.\n",
    "\n",
    "This dataset will be used to train the model to recognize the relationship between the pixel values of the 28x28 image, and the digit that we tell the model that the pixel values correspond to. The goal of the neural network that you develop over the course of this lab will be to take in an image of a digit, and correctly classify it by outputting the correct digit. No neural network is going to be perfect, but we hope to achieve high accuracy.\n",
    "\n",
    "Accuracy in this lab is defined specifically as the number of correct predictions divided by the total number of predictions.\n",
    "\n",
    "### $ Accuracy = {Correct\\_Predictions \\over Total\\_Predictions} $\n",
    "\n",
    "Note that this metric isn't always the optimal metric for every situation, and some ML systems will need different metrics, but for the purposes of this course, it will work well enough\n",
    "\n",
    "At the end of this notebook, we'll really understand what our data looks like, and that will help us when constructing the AI in the future notebook\n",
    "\n",
    "To start, let's load the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:09:25.145253Z",
     "start_time": "2021-03-15T19:09:21.425920Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: boto3==1.17.11 in /opt/app-root/lib/python3.8/site-packages (from -r requirements.txt (line 1)) (1.17.11)\n",
      "Requirement already satisfied: pandas==1.2.3 in /opt/app-root/lib/python3.8/site-packages (from -r requirements.txt (line 2)) (1.2.3)\n",
      "Requirement already satisfied: h5py==2.10.0 in /opt/app-root/lib/python3.8/site-packages (from -r requirements.txt (line 3)) (2.10.0)\n",
      "Requirement already satisfied: keras==2.4.* in /opt/app-root/lib/python3.8/site-packages (from -r requirements.txt (line 4)) (2.4.3)\n",
      "Requirement already satisfied: requests in /opt/app-root/lib/python3.8/site-packages (from -r requirements.txt (line 5)) (2.25.1)\n",
      "Requirement already satisfied: Pillow in /opt/app-root/lib/python3.8/site-packages (from -r requirements.txt (line 6)) (8.2.0)\n",
      "Requirement already satisfied: botocore<1.21.0,>=1.20.11 in /opt/app-root/lib/python3.8/site-packages (from boto3==1.17.11->-r requirements.txt (line 1)) (1.20.101)\n",
      "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/app-root/lib/python3.8/site-packages (from boto3==1.17.11->-r requirements.txt (line 1)) (0.3.7)\n",
      "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/app-root/lib/python3.8/site-packages (from boto3==1.17.11->-r requirements.txt (line 1)) (0.10.0)\n",
      "Requirement already satisfied: numpy>=1.16.5 in /opt/app-root/lib/python3.8/site-packages (from pandas==1.2.3->-r requirements.txt (line 2)) (1.21.0)\n",
      "Requirement already satisfied: pytz>=2017.3 in /opt/app-root/lib/python3.8/site-packages (from pandas==1.2.3->-r requirements.txt (line 2)) (2021.1)\n",
      "Requirement already satisfied: python-dateutil>=2.7.3 in /opt/app-root/lib/python3.8/site-packages (from pandas==1.2.3->-r requirements.txt (line 2)) (2.8.1)\n",
      "Requirement already satisfied: six in /opt/app-root/lib/python3.8/site-packages (from h5py==2.10.0->-r requirements.txt (line 3)) (1.16.0)\n",
      "Requirement already satisfied: scipy>=0.14 in /opt/app-root/lib/python3.8/site-packages (from keras==2.4.*->-r requirements.txt (line 4)) (1.7.1)\n",
      "Requirement already satisfied: pyyaml in /opt/app-root/lib/python3.8/site-packages (from keras==2.4.*->-r requirements.txt (line 4)) (5.4.1)\n",
      "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/app-root/lib/python3.8/site-packages (from botocore<1.21.0,>=1.20.11->boto3==1.17.11->-r requirements.txt (line 1)) (1.26.6)\n",
      "Requirement already satisfied: chardet<5,>=3.0.2 in /opt/app-root/lib/python3.8/site-packages (from requests->-r requirements.txt (line 5)) (4.0.0)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /opt/app-root/lib/python3.8/site-packages (from requests->-r requirements.txt (line 5)) (2021.5.30)\n",
      "Requirement already satisfied: idna<3,>=2.5 in /opt/app-root/lib/python3.8/site-packages (from requests->-r requirements.txt (line 5)) (2.10)\n",
      "\u001b[33mWARNING: You are using pip version 21.1.3; however, version 21.3.1 is available.\n",
      "You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:09:32.764360Z",
     "start_time": "2021-03-15T19:09:32.731819Z"
    }
   },
   "outputs": [],
   "source": [
    "# Some common imports\n",
    "import os\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas is a column-oriented data manipulation library, it's very commonly used, and has a parallel in Spark in their dataframe API\n",
    "\n",
    "One of the convenient things is that Pandas can load many formats of data easily"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:09:38.598016Z",
     "start_time": "2021-03-15T19:09:34.944090Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "train_df: pd.DataFrame = pd.read_csv('Resources/mnist_train.csv', header=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Printing a DataFrame will let us examine the data. Helpfully (or unhelpfully) it will truncate the output to roughly match the size of the terminal. In this case we see pretty much all zeros. This might lead us to wonder if there's any data in there at all, so we'll look at a few more ways to look at the data in the coming cells"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:09:41.034069Z",
     "start_time": "2021-03-15T19:09:41.008338Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>775</th>\n",
       "      <th>776</th>\n",
       "      <th>777</th>\n",
       "      <th>778</th>\n",
       "      <th>779</th>\n",
       "      <th>780</th>\n",
       "      <th>781</th>\n",
       "      <th>782</th>\n",
       "      <th>783</th>\n",
       "      <th>784</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59995</th>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59996</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59997</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59998</th>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59999</th>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>60000 rows × 785 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       0    1    2    3    4    5    6    7    8    9    ...  775  776  777  \\\n",
       "0        5    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "1        0    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "2        4    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "3        1    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "4        9    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "...    ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   \n",
       "59995    8    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "59996    3    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "59997    5    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "59998    6    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "59999    8    0    0    0    0    0    0    0    0    0  ...    0    0    0   \n",
       "\n",
       "       778  779  780  781  782  783  784  \n",
       "0        0    0    0    0    0    0    0  \n",
       "1        0    0    0    0    0    0    0  \n",
       "2        0    0    0    0    0    0    0  \n",
       "3        0    0    0    0    0    0    0  \n",
       "4        0    0    0    0    0    0    0  \n",
       "...    ...  ...  ...  ...  ...  ...  ...  \n",
       "59995    0    0    0    0    0    0    0  \n",
       "59996    0    0    0    0    0    0    0  \n",
       "59997    0    0    0    0    0    0    0  \n",
       "59998    0    0    0    0    0    0    0  \n",
       "59999    0    0    0    0    0    0    0  \n",
       "\n",
       "[60000 rows x 785 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take a look at only a few rows with `head()` or `tail()`. It accepts an argument for how many rows 5 is the default, which is probably what we want\n",
    "\n",
    "Note that when looking at the data, column 0 appears to have our label, while the rest appear to be zeros. If we were to look at a bit more expanded dataframe, we would see that there's some padding around the digits. We'll examine it a bit closer in the next few steps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:09:44.774720Z",
     "start_time": "2021-03-15T19:09:44.761518Z"
    }
   },
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also see some basic statistics about the data in the DataFrame with the `describe()` method\n",
    "This is useful as a basic sanity check that the data corresponds to what we expect\n",
    "\n",
    "These statistics aren't particularly useful for this dataset, but it's helpful to see some of them nonetheless\n",
    "We can see that in columns 775-780 there are some non-zero pixels. That at least tells us some images extend that far"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:09:51.410735Z",
     "start_time": "2021-03-15T19:09:48.945910Z"
    }
   },
   "outputs": [],
   "source": [
    "train_df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, this helps, but doesn't really show us the meat of what we need to see to understand the data.\n",
    "\n",
    "Let's now look at a single example. We use the `loc` field to extract a single row and we can look at the field 'values' to get the underlying numpy array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:09:55.459918Z",
     "start_time": "2021-03-15T19:09:55.454633Z"
    }
   },
   "outputs": [],
   "source": [
    "train_df.loc[0].values\n",
    "\n",
    "# That looks better, at least now we're sure there's data in there"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have some idea of what the data in the DataFrame looks like, we could proceed, but it's also good to build just a bit more intuition about it since we can see the values but they don't yet really give us a clear picture of what we're looking at\n",
    "\n",
    "In this case, the best way would be to look at them graphically, excluding the first element which is the label\n",
    "\n",
    "To do that, we bust out our trusty pyplot library. Graphical representations of data often are the most intuitive\n",
    "way to examine data, and especially image data\n",
    "\n",
    "To get a good look at our data, let's take that first example and see that the label does correspond to the image in the way we expect"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:10:00.270075Z",
     "start_time": "2021-03-15T19:10:00.265423Z"
    }
   },
   "outputs": [],
   "source": [
    "selected_feature_row = 0\n",
    "# Get the element at [0, 0] (note that this supports the exact same python slicing that you're used to)\n",
    "label = train_df.loc[selected_feature_row, 0]\n",
    "print(\"Label: %d\" % label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that our label for row 0 is 5\n",
    "Now let's look at the rest of it graphically"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We take the rest of row 0 and call it \"features\", which is an ML term that roughly refers to the characteristics of a piece of data corresponding to the label above.\n",
    "\n",
    "ML is all about generating or predicting a \"label\" given a set of \"features\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:10:04.366315Z",
     "start_time": "2021-03-15T19:10:04.362799Z"
    }
   },
   "outputs": [],
   "source": [
    "features = train_df.loc[selected_feature_row, 1:].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we could look at this as a single dimensional array when we pass it to the next bit, but instead since it's an image, we want to reshape it into the proper 2d shape. That will give us the most clarity about what we want to see. Note that reshape takes a single tuple, rather than a set of integer arguments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:10:07.790758Z",
     "start_time": "2021-03-15T19:10:07.785207Z"
    }
   },
   "outputs": [],
   "source": [
    "reshaped_features = features.reshape((28, 28))\n",
    "# Set this to show the full line width, the notebook will truncate it, but it helps show the data better\n",
    "np.set_printoptions(linewidth=150)\n",
    "reshaped_features\n",
    "\n",
    "# We can see it's got some pretty regular structure here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pyplot.imshow()` takes in an array of values (or n-dimensional array) and prints an image based on their values.\n",
    "\n",
    "It's useful for things like displaying images from pixels, or heatmaps, etc. We're using it for the former, but it has many other uses\n",
    "\n",
    "In our case, we pass in the reshaped 28x28 array from above, and tell it to use the 'greyscale' color map. You can check the documentation for pyplot to learn more about the features and other keywords it supports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:10:14.065794Z",
     "start_time": "2021-03-15T19:10:13.927440Z"
    }
   },
   "outputs": [],
   "source": [
    "print(\"Label: %d\" % label)\n",
    "plt.imshow(reshaped_features, cmap='Greys')\n",
    "\n",
    "# Printing this out, we see a pixellated 5, that tells us pretty much what we want to know about the way\n",
    "# the data looks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the below cell to look at other examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-03-15T19:10:18.356463Z",
     "start_time": "2021-03-15T19:10:18.224983Z"
    }
   },
   "outputs": [],
   "source": [
    "selected_feature_row = np.random.randint(0, 59999)\n",
    "label = train_df.loc[selected_feature_row, 0]\n",
    "features = train_df.loc[selected_feature_row, 1:].values\n",
    "print(\"Label: %d\" % label)\n",
    "plt.imshow(features.reshape((28, 28)), cmap='Greys')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
